Skip to content

fix: clean up Temporal server-side versioning data on TWD deletion#240

Open
anujagrawal380 wants to merge 6 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data
Open

fix: clean up Temporal server-side versioning data on TWD deletion#240
anujagrawal380 wants to merge 6 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data

Conversation

@anujagrawal380
Copy link
Copy Markdown

@anujagrawal380 anujagrawal380 commented Mar 24, 2026

  • Add a finalizer to TemporalWorkerDeployment to run Temporal server-side cleanup before K8s deletion
  • Add a finalizer to TemporalConnection to prevent it from being deleted while any TWD still references it
  • On TWD deletion, set current version to unversioned, clear ramping version, and delete registered versions

Problem

When a TemporalWorkerDeployment CRD is deleted (e.g., switching back to plain Deployments), the Temporal server retains the build ID routing configuration. The matching service continues routing new tasks to the deleted build ID's physical queue, while unversioned workers poll a different physical queue. Tasks sit in Scheduled state indefinitely with no errors.

A secondary race condition exists: Helm deletes both the TemporalConnection and TWD in the same upgrade. Without the connection, the controller cannot talk to Temporal to clean up. This is solved by adding a finalizer to the TemporalConnection that blocks its deletion until all referencing TWDs are gone.

Changes

internal/controller/worker_controller.go:

TWD finalizer (temporal.io/worker-deployment-cleanup):

  • Added to all TWD resources during normal reconciliation
  • On deletion, triggers handleDeletion() which:
    1. Sets the current version to unversioned (BuildID: "") -- the critical step that unblocks task dispatch
    2. Clears any ramping version
    3. Deletes all registered versions with SkipDrainage: true
    4. Attempts to delete the deployment record itself
    5. Removes the connection finalizer if no other TWDs reference it
    6. Removes its own finalizer, allowing K8s to complete deletion

TemporalConnection finalizer (temporal.io/connection-in-use):

  • Added to the TemporalConnection during normal TWD reconciliation via ensureConnectionFinalizer()
  • Prevents the connection from being deleted while any TWD still references it
  • Removed by removeConnectionFinalizerIfUnused() during TWD deletion, after checking no other TWDs in the same namespace reference the connection
  • Guarantees the connection is always available during TWD cleanup -- no race condition with Helm deleting both resources simultaneously

RBAC updates:

  • Added update;patch verbs for temporalconnections (was get;list;watch)
  • Added update verb for temporalconnections/finalizers

Deletion flow

Helm upgrade (TWD disabled)
  |
  v
Helm deletes TWD CRD + TemporalConnection CRD simultaneously
  |
  +--> TemporalConnection: has finalizer, K8s sets DeletionTimestamp but blocks deletion
  |
  +--> TWD: has finalizer, K8s sets DeletionTimestamp, triggers Reconcile
         |
         v
       handleDeletion() runs:
         1. Fetches TemporalConnection (guaranteed to exist via finalizer)
         2. Connects to Temporal server
         3. Sets current version to unversioned
         4. Deletes versions
         5. Removes connection finalizer (no other TWDs reference it)
         6. Removes TWD finalizer
         |
         v
       TWD deleted by K8s
         |
         v
       TemporalConnection: no more finalizers, deleted by K8s

Issue #55
Closes #166

@anujagrawal380 anujagrawal380 requested review from a team and jlegrone as code owners March 24, 2026 18:18
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 24, 2026

CLA assistant check
All committers have signed the CLA.

@anujagrawal380
Copy link
Copy Markdown
Author

PTAL @carlydf

Copy link
Copy Markdown

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/worker_controller.go
@anujagrawal380
Copy link
Copy Markdown
Author

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Thanks, resolved both the comments!

Copy link
Copy Markdown

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rock on :) nice work on this @anujagrawal380!

Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need integration tests for this before merging to main / including it in a release

@anujagrawal380
Copy link
Copy Markdown
Author

we need integration tests for this before merging to main / including it in a release

@carlydf Added the integration tests. PTAL

@carlydf
Copy link
Copy Markdown
Collaborator

carlydf commented Apr 22, 2026

Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release

…blocking unversioned workers

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
@anujagrawal380 anujagrawal380 force-pushed the fix/twd-leaves-stale-versioning-data branch 2 times, most recently from 22607bf to 5a2e844 Compare April 22, 2026 18:06
…ion finalizer

- Add 5-minute deletionCleanupTimeout to prevent TWD stuck in Terminating
  state indefinitely if Temporal server is unavailable
- Return errors from version/deployment deletion to trigger requeue until
  versions actually clear (pollers disappear as pods terminate)
- Add update/patch verbs and finalizers RBAC marker for TemporalConnections
- Fix comment-spacing lint on new kubebuilder:rbac markers
@anujagrawal380 anujagrawal380 force-pushed the fix/twd-leaves-stale-versioning-data branch from d6a305c to 9fd0c74 Compare April 22, 2026 18:07
@anujagrawal380
Copy link
Copy Markdown
Author

anujagrawal380 commented Apr 22, 2026

Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release

@carlydf @jaypipes Added few more minor improvements here: 9fd0c74 . PTAL!

Comment on lines +55 to +59
// deletionCleanupTimeout is the maximum duration to retry Temporal server-side
// cleanup before giving up and allowing the K8s resource to be deleted.
// This prevents the TWD from being stuck in Terminating state indefinitely
// if the Temporal server is unavailable or a version has persistent active pollers.
deletionCleanupTimeout = 5 * time.Minute
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal of this finalizer is for the kubernets / TemporalWorkerDeployment perspective and the Temporal server perspective to be aligned. So if the server-side object was created by creating the k8s-side object, the server-side object should also be deleted in the same way.

This timeout breaks that expectation. If the server is temporarily unavailable and this finalizer gives up and deletes the k8s-side object, the server-side object would never be deleted.

I'm curious if you ran into this while testing? The default TTL for "active pollers" in server is 5 minutes, so if when the TWD enters Terminating state it is running active pollers, the controller needs to kill those Deployments and then wait 5 minutes before the "no active pollers" check passes. If all versions were Drained and the pods had been scaled down for a while, this delay wouldn't exist.

Because of that 5 minute poller TTL, a 5 minute deletionCleanupTimeout would frequently be used in case of deletion before natural scaledown, which would IMO not be good (because of the leftover server-side object thing I explained above).

If we decide to keep this, I would advocate for:

  • A very long threshold, like 1h
  • Only timing out on unavailable errors from the server, not precondition failed (which is the "active pollers" thing)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to @jaypipes opinion here though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup of Temporal deployments when TemporalWorker CRD is deleted

4 participants